00_all

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(patchwork)
library(ggplot2)
library(knitr)
library("quarto")
library(here)
here() starts at /home/people/s243903/projects/project_r4bd
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:patchwork':

    align_plots

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)
library(readr)
library(stringr)

Create results folder if non exisisting

# Create results folder if non-existing
if (!dir.exists("../results")) {
  dir.create("../results")
}

We will use quarto::render

Loading data

# Render the other .qmd files
quarto::quarto_render("../R/01_load.qmd")


processing file: 01_load.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |..........                                          |  20%                  
  |                                                          
  |.....................                               |  40% [unnamed-chunk-1]
  |                                                          
  |...............................                     |  60%                  
  |                                                          
  |..........................................          |  80% [unnamed-chunk-2]trying URL 'https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip'
downloaded 99 KB


  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 01_load.knit.md

pandoc 
  to: html
  output-file: 01_load.html
  standalone: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 01_load
  editor: visual
  
Output created: 01_load.html
#move to results
file.rename(here("R/01_load.html"), here("results/01_load.html"))
[1] TRUE

clean data

quarto::quarto_render("../R/02_clean.qmd")


processing file: 02_clean.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |...                                                 |   7%                  
  |                                                          
  |.......                                             |  13% [unnamed-chunk-1]
  |                                                          
  |..........                                          |  20%                  
  |                                                          
  |..............                                      |  27% [unnamed-chunk-2]
  |                                                          
  |.................                                   |  33%                  
  |                                                          
  |.....................                               |  40% [unnamed-chunk-3]
  |                                                          
  |........................                            |  47%                  
  |                                                          
  |............................                        |  53% [unnamed-chunk-4]
  |                                                          
  |...............................                     |  60%                  
  |                                                          
  |...................................                 |  67% [unnamed-chunk-5]
  |                                                          
  |......................................              |  73%                  
  |                                                          
  |..........................................          |  80% [unnamed-chunk-6]
  |                                                          
  |.............................................       |  87%                  
  |                                                          
  |.................................................   |  93% [unnamed-chunk-7]
  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 02_clean.knit.md

pandoc 
  to: html
  output-file: 02_clean.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 02_clean
  editor: visual
  
Output created: 02_clean.html
#move to results:
file.rename(here("R/02_clean.html"), here("results/02_clean.html"))
[1] TRUE

Augment the data

quarto::quarto_render("../R/03_augment.qmd")


processing file: 03_augment.qmd

  |                                                         
  |                                                   |   0%
  |                                                         
  |..                                                 |   5%                   
  |                                                         
  |.....                                              |  10% [unnamed-chunk-1] 
  |                                                         
  |.......                                            |  14%                   
  |                                                         
  |..........                                         |  19% [unnamed-chunk-2] 
  |                                                         
  |............                                       |  24%                   
  |                                                         
  |...............                                    |  29% [unnamed-chunk-3] 
  |                                                         
  |.................                                  |  33%                   
  |                                                         
  |...................                                |  38% [unnamed-chunk-4] 
  |                                                         
  |......................                             |  43%                   
  |                                                         
  |........................                           |  48% [unnamed-chunk-5] 
  |                                                         
  |...........................                        |  52%                   
  |                                                         
  |.............................                      |  57% [unnamed-chunk-6] 
  |                                                         
  |................................                   |  62%                   
  |                                                         
  |..................................                 |  67% [unnamed-chunk-7] 
  |                                                         
  |....................................               |  71%                   
  |                                                         
  |.......................................            |  76% [unnamed-chunk-8] 
  |                                                         
  |.........................................          |  81%                   
  |                                                         
  |............................................       |  86% [unnamed-chunk-9] 
  |                                                         
  |..............................................     |  90%                   
  |                                                         
  |.................................................  |  95% [unnamed-chunk-10]
  |                                                         
  |...................................................| 100%                   
                                                                                                             
output file: 03_augment.knit.md

pandoc 
  to: html
  output-file: 03_augment.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 03_augment
  editor: visual
  
Output created: 03_augment.html
#move to results:
file.rename(here("R/03_augment.html"), here("results/03_augment.html"))
[1] TRUE

Describe the data

quarto::quarto_render("../R/04_describe.qmd")


processing file: 04_describe.qmd

  |                                                         
  |                                                   |   0%
  |                                                         
  |..                                                 |   4%                   
  |                                                         
  |....                                               |   9% [unnamed-chunk-1] 
  |                                                         
  |.......                                            |  13%                   
  |                                                         
  |.........                                          |  17% [unnamed-chunk-2] 
  |                                                         
  |...........                                        |  22%                   
  |                                                         
  |.............                                      |  26% [unnamed-chunk-3] 
  |                                                         
  |................                                   |  30%                   
  |                                                         
  |..................                                 |  35% [unnamed-chunk-4] 
  |                                                         
  |....................                               |  39%                   
  |                                                         
  |......................                             |  43% [unnamed-chunk-5] 
  |                                                         
  |........................                           |  48%                   
  |                                                         
  |...........................                        |  52% [unnamed-chunk-6] 
  |                                                         
  |.............................                      |  57%                   
  |                                                         
  |...............................                    |  61% [unnamed-chunk-7] 
  |                                                         
  |.................................                  |  65%                   
  |                                                         
  |...................................                |  70% [unnamed-chunk-8] 
  |                                                         
  |......................................             |  74%                   
  |                                                         
  |........................................           |  78% [unnamed-chunk-9] 
  |                                                         
  |..........................................         |  83%                   
  |                                                         
  |............................................       |  87% [unnamed-chunk-10]
  |                                                         
  |...............................................    |  91%                   
  |                                                         
  |.................................................  |  96% [unnamed-chunk-11]
  |                                                         
  |...................................................| 100%                   
                                                                                                             
output file: 04_describe.knit.md

pandoc 
  to: html
  output-file: 04_describe.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 04_describe
  editor: visual
  
Output created: 04_describe.html
#mnove to results
file.rename(here("R/04_describe.html"), here("results/04_describe.html"))
[1] TRUE

Analysis 1

quarto::quarto_render("../R/05_analysis_1.qmd")


processing file: 05_analysis_1.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |...                                                 |   7%                  
  |                                                          
  |.......                                             |  13% [unnamed-chunk-1]
  |                                                          
  |..........                                          |  20%                  
  |                                                          
  |..............                                      |  27% [unnamed-chunk-2]
  |                                                          
  |.................                                   |  33%                  
  |                                                          
  |.....................                               |  40% [unnamed-chunk-3]
  |                                                          
  |........................                            |  47%                  
  |                                                          
  |............................                        |  53% [unnamed-chunk-4]
  |                                                          
  |...............................                     |  60%                  
  |                                                          
  |...................................                 |  67% [unnamed-chunk-5]
  |                                                          
  |......................................              |  73%                  
  |                                                          
  |..........................................          |  80% [unnamed-chunk-6]
  |                                                          
  |.............................................       |  87%                  
  |                                                          
  |.................................................   |  93% [unnamed-chunk-7]
  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 05_analysis_1.knit.md

pandoc 
  to: html
  output-file: 05_analysis_1.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 05_analysis_1
  editor: visual
  
Output created: 05_analysis_1.html
#move to results:
file.rename(here("R/05_analysis_1.html"), here("results/05_analysis_1.html"))
[1] TRUE

Analysis 2 PCA

quarto::quarto_render("../R/06_analysis_2_PCA.qmd")


processing file: 06_analysis_2_PCA.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |.....                                               |   9%                  
  |                                                          
  |.........                                           |  18% [unnamed-chunk-1]
  |                                                          
  |..............                                      |  27%                  
  |                                                          
  |...................                                 |  36% [unnamed-chunk-2]
  |                                                          
  |........................                            |  45%                  
  |                                                          
  |............................                        |  55% [unnamed-chunk-3]
  |                                                          
  |.................................                   |  64%                  
  |                                                          
  |......................................              |  73% [unnamed-chunk-4]
  |                                                          
  |...........................................         |  82%                  
  |                                                          
  |...............................................     |  91% [unnamed-chunk-5]
  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 06_analysis_2_PCA.knit.md

pandoc 
  to: html
  output-file: 06_analysis_2_PCA.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: PCA
  editor: visual
  
Output created: 06_analysis_2_PCA.html
file.rename(here("R/06_analysis_2_PCA.html"), here("results/06_analysis_2_PCA.html"))
[1] TRUE

Finally also render this document to the results:

files <- c("../R/01_load.qmd", "../R/02_clean.qmd", "../R/03_augment.qmd",
           "../R/04_describe.qmd", "../R/05_analysis_1.qmd", "../R/06_analysis_2_PCA.qmd")
lapply(files, quarto::quarto_render)


processing file: 01_load.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |..........                                          |  20%                  
  |                                                          
  |.....................                               |  40% [unnamed-chunk-1]
  |                                                          
  |...............................                     |  60%                  
  |                                                          
  |..........................................          |  80% [unnamed-chunk-2]trying URL 'https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip'
downloaded 99 KB


  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 01_load.knit.md

pandoc 
  to: html
  output-file: 01_load.html
  standalone: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 01_load
  editor: visual
  
Output created: 01_load.html



processing file: 02_clean.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |...                                                 |   7%                  
  |                                                          
  |.......                                             |  13% [unnamed-chunk-1]
  |                                                          
  |..........                                          |  20%                  
  |                                                          
  |..............                                      |  27% [unnamed-chunk-2]
  |                                                          
  |.................                                   |  33%                  
  |                                                          
  |.....................                               |  40% [unnamed-chunk-3]
  |                                                          
  |........................                            |  47%                  
  |                                                          
  |............................                        |  53% [unnamed-chunk-4]
  |                                                          
  |...............................                     |  60%                  
  |                                                          
  |...................................                 |  67% [unnamed-chunk-5]
  |                                                          
  |......................................              |  73%                  
  |                                                          
  |..........................................          |  80% [unnamed-chunk-6]
  |                                                          
  |.............................................       |  87%                  
  |                                                          
  |.................................................   |  93% [unnamed-chunk-7]
  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 02_clean.knit.md

pandoc 
  to: html
  output-file: 02_clean.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 02_clean
  editor: visual
  
Output created: 02_clean.html



processing file: 03_augment.qmd

  |                                                         
  |                                                   |   0%
  |                                                         
  |..                                                 |   5%                   
  |                                                         
  |.....                                              |  10% [unnamed-chunk-1] 
  |                                                         
  |.......                                            |  14%                   
  |                                                         
  |..........                                         |  19% [unnamed-chunk-2] 
  |                                                         
  |............                                       |  24%                   
  |                                                         
  |...............                                    |  29% [unnamed-chunk-3] 
  |                                                         
  |.................                                  |  33%                   
  |                                                         
  |...................                                |  38% [unnamed-chunk-4] 
  |                                                         
  |......................                             |  43%                   
  |                                                         
  |........................                           |  48% [unnamed-chunk-5] 
  |                                                         
  |...........................                        |  52%                   
  |                                                         
  |.............................                      |  57% [unnamed-chunk-6] 
  |                                                         
  |................................                   |  62%                   
  |                                                         
  |..................................                 |  67% [unnamed-chunk-7] 
  |                                                         
  |....................................               |  71%                   
  |                                                         
  |.......................................            |  76% [unnamed-chunk-8] 
  |                                                         
  |.........................................          |  81%                   
  |                                                         
  |............................................       |  86% [unnamed-chunk-9] 
  |                                                         
  |..............................................     |  90%                   
  |                                                         
  |.................................................  |  95% [unnamed-chunk-10]
  |                                                         
  |...................................................| 100%                   
                                                                                                             
output file: 03_augment.knit.md

pandoc 
  to: html
  output-file: 03_augment.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 03_augment
  editor: visual
  
Output created: 03_augment.html



processing file: 04_describe.qmd

  |                                                         
  |                                                   |   0%
  |                                                         
  |..                                                 |   4%                   
  |                                                         
  |....                                               |   9% [unnamed-chunk-1] 
  |                                                         
  |.......                                            |  13%                   
  |                                                         
  |.........                                          |  17% [unnamed-chunk-2] 
  |                                                         
  |...........                                        |  22%                   
  |                                                         
  |.............                                      |  26% [unnamed-chunk-3] 
  |                                                         
  |................                                   |  30%                   
  |                                                         
  |..................                                 |  35% [unnamed-chunk-4] 
  |                                                         
  |....................                               |  39%                   
  |                                                         
  |......................                             |  43% [unnamed-chunk-5] 
  |                                                         
  |........................                           |  48%                   
  |                                                         
  |...........................                        |  52% [unnamed-chunk-6] 
  |                                                         
  |.............................                      |  57%                   
  |                                                         
  |...............................                    |  61% [unnamed-chunk-7] 
  |                                                         
  |.................................                  |  65%                   
  |                                                         
  |...................................                |  70% [unnamed-chunk-8] 
  |                                                         
  |......................................             |  74%                   
  |                                                         
  |........................................           |  78% [unnamed-chunk-9] 
  |                                                         
  |..........................................         |  83%                   
  |                                                         
  |............................................       |  87% [unnamed-chunk-10]
  |                                                         
  |...............................................    |  91%                   
  |                                                         
  |.................................................  |  96% [unnamed-chunk-11]
  |                                                         
  |...................................................| 100%                   
                                                                                                             
output file: 04_describe.knit.md

pandoc 
  to: html
  output-file: 04_describe.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 04_describe
  editor: visual
  
Output created: 04_describe.html



processing file: 05_analysis_1.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |...                                                 |   7%                  
  |                                                          
  |.......                                             |  13% [unnamed-chunk-1]
  |                                                          
  |..........                                          |  20%                  
  |                                                          
  |..............                                      |  27% [unnamed-chunk-2]
  |                                                          
  |.................                                   |  33%                  
  |                                                          
  |.....................                               |  40% [unnamed-chunk-3]
  |                                                          
  |........................                            |  47%                  
  |                                                          
  |............................                        |  53% [unnamed-chunk-4]
  |                                                          
  |...............................                     |  60%                  
  |                                                          
  |...................................                 |  67% [unnamed-chunk-5]
  |                                                          
  |......................................              |  73%                  
  |                                                          
  |..........................................          |  80% [unnamed-chunk-6]
  |                                                          
  |.............................................       |  87%                  
  |                                                          
  |.................................................   |  93% [unnamed-chunk-7]
  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 05_analysis_1.knit.md

pandoc 
  to: html
  output-file: 05_analysis_1.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: 05_analysis_1
  editor: visual
  
Output created: 05_analysis_1.html



processing file: 06_analysis_2_PCA.qmd

  |                                                          
  |                                                    |   0%
  |                                                          
  |.....                                               |   9%                  
  |                                                          
  |.........                                           |  18% [unnamed-chunk-1]
  |                                                          
  |..............                                      |  27%                  
  |                                                          
  |...................                                 |  36% [unnamed-chunk-2]
  |                                                          
  |........................                            |  45%                  
  |                                                          
  |............................                        |  55% [unnamed-chunk-3]
  |                                                          
  |.................................                   |  64%                  
  |                                                          
  |......................................              |  73% [unnamed-chunk-4]
  |                                                          
  |...........................................         |  82%                  
  |                                                          
  |...............................................     |  91% [unnamed-chunk-5]
  |                                                          
  |....................................................| 100%                  
                                                                                                            
output file: 06_analysis_2_PCA.knit.md

pandoc 
  to: html
  output-file: 06_analysis_2_PCA.html
  standalone: true
  self-contained: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  
metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: PCA
  editor: visual
  
Output created: 06_analysis_2_PCA.html
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL
# Move the generated HTML file to the 'results' folder
file.rename(here("R/00_all.html"), here("results/00_all.html"))
Warning in file.rename(here("R/00_all.html"), here("results/00_all.html")):
cannot rename file '/home/people/s243903/projects/project_r4bd/R/00_all.html'
to '/home/people/s243903/projects/project_r4bd/results/00_all.html', reason 'No
such file or directory'
[1] FALSE
01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /home/people/s243903/projects/project_r4bd

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /home/people/s243903/projects/project_r4bd/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /home/people/s243903/projects/project_r4bd/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /home/people/s243903/projects/project_r4bd/data/_raw//cervical_cancer_risk_factors.zip, /home/people/s243903/projects/project_r4bd/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /home/people/s243903/projects/project_r4bd/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis",
    caption = "Source: https://doi.org/10.24432/C5Z310")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis",
       caption = "Source: https://doi.org/10.24432/C5Z310") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot_1 <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  # Dynamically use variable names if no custom labels are provided
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)
  
  # Summarize the data by grouping by the ID and selecting the first value for each variable
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(.data[[x_var]]),  # Taking the first value for x_var per ID
      fill_value = first(.data[[fill_var]])   # Taking the first value for fill_var per ID
    )
  
  # Remove rows with NA values in either x_value or fill_value
  filtered_data <- data_summary[!is.na(data_summary$x_value) & !is.na(data_summary$fill_value), ]
  
  # Convert the fill variable to a factor
  filtered_data$fill_value <- as.factor(filtered_data$fill_value)
  
  # Generate a color palette dynamically
  colors <- setNames(c("lightblue", "darkred"), levels(filtered_data$fill_value))
  
  # Generate the plot
  ggplot(filtered_data, aes(x = .data$x_value, fill = .data$fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = colors) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

Function for making a chi squared test

calculate_chi_squared_1 <- function(data, var1, var2) {
  # Summarize the data so that each ID has one row
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(.data[[var1]]),  # Taking the first value for var1 per ID
      var2_value = first(.data[[var2]])   # Taking the first value for var2 per ID
    )
  
  # Remove rows with NA values in either of the variables
  filtered_data <- data_summary[!is.na(data_summary$var1_value) & !is.na(data_summary$var2_value), ]
  
  # Create the contingency table
  contingency_table <- table(filtered_data$var1_value, filtered_data$var2_value)
  
  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /home/people/s243903/projects/project_r4bd

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /home/people/s243903/projects/project_r4bd/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /home/people/s243903/projects/project_r4bd/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /home/people/s243903/projects/project_r4bd/data/_raw//cervical_cancer_risk_factors.zip, /home/people/s243903/projects/project_r4bd/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /home/people/s243903/projects/project_r4bd/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis",
    caption = "Source: https://doi.org/10.24432/C5Z310")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis",
       caption = "Source: https://doi.org/10.24432/C5Z310") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot_1 <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  # Dynamically use variable names if no custom labels are provided
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)
  
  # Summarize the data by grouping by the ID and selecting the first value for each variable
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(.data[[x_var]]),  # Taking the first value for x_var per ID
      fill_value = first(.data[[fill_var]])   # Taking the first value for fill_var per ID
    )
  
  # Remove rows with NA values in either x_value or fill_value
  filtered_data <- data_summary[!is.na(data_summary$x_value) & !is.na(data_summary$fill_value), ]
  
  # Convert the fill variable to a factor
  filtered_data$fill_value <- as.factor(filtered_data$fill_value)
  
  # Generate a color palette dynamically
  colors <- setNames(c("lightblue", "darkred"), levels(filtered_data$fill_value))
  
  # Generate the plot
  ggplot(filtered_data, aes(x = .data$x_value, fill = .data$fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = colors) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

Function for making a chi squared test

calculate_chi_squared_1 <- function(data, var1, var2) {
  # Summarize the data so that each ID has one row
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(.data[[var1]]),  # Taking the first value for var1 per ID
      var2_value = first(.data[[var2]])   # Taking the first value for var2 per ID
    )
  
  # Remove rows with NA values in either of the variables
  filtered_data <- data_summary[!is.na(data_summary$var1_value) & !is.na(data_summary$var2_value), ]
  
  # Create the contingency table
  contingency_table <- table(filtered_data$var1_value, filtered_data$var2_value)
  
  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /home/people/s243903/projects/project_r4bd

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /home/people/s243903/projects/project_r4bd/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /home/people/s243903/projects/project_r4bd/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /home/people/s243903/projects/project_r4bd/data/_raw//cervical_cancer_risk_factors.zip, /home/people/s243903/projects/project_r4bd/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /home/people/s243903/projects/project_r4bd/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis",
    caption = "Source: https://doi.org/10.24432/C5Z310")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis",
       caption = "Source: https://doi.org/10.24432/C5Z310") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot_1 <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  # Dynamically use variable names if no custom labels are provided
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)
  
  # Summarize the data by grouping by the ID and selecting the first value for each variable
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(.data[[x_var]]),  # Taking the first value for x_var per ID
      fill_value = first(.data[[fill_var]])   # Taking the first value for fill_var per ID
    )
  
  # Remove rows with NA values in either x_value or fill_value
  filtered_data <- data_summary[!is.na(data_summary$x_value) & !is.na(data_summary$fill_value), ]
  
  # Convert the fill variable to a factor
  filtered_data$fill_value <- as.factor(filtered_data$fill_value)
  
  # Generate a color palette dynamically
  colors <- setNames(c("lightblue", "darkred"), levels(filtered_data$fill_value))
  
  # Generate the plot
  ggplot(filtered_data, aes(x = .data$x_value, fill = .data$fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = colors) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

Function for making a chi squared test

calculate_chi_squared_1 <- function(data, var1, var2) {
  # Summarize the data so that each ID has one row
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(.data[[var1]]),  # Taking the first value for var1 per ID
      var2_value = first(.data[[var2]])   # Taking the first value for var2 per ID
    )
  
  # Remove rows with NA values in either of the variables
  filtered_data <- data_summary[!is.na(data_summary$var1_value) & !is.na(data_summary$var2_value), ]
  
  # Create the contingency table
  contingency_table <- table(filtered_data$var1_value, filtered_data$var2_value)
  
  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /home/people/s243903/projects/project_r4bd

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /home/people/s243903/projects/project_r4bd/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /home/people/s243903/projects/project_r4bd/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /home/people/s243903/projects/project_r4bd/data/_raw//cervical_cancer_risk_factors.zip, /home/people/s243903/projects/project_r4bd/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /home/people/s243903/projects/project_r4bd/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis",
    caption = "Source: https://doi.org/10.24432/C5Z310")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis",
       caption = "Source: https://doi.org/10.24432/C5Z310") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot_1 <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  # Dynamically use variable names if no custom labels are provided
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)
  
  # Summarize the data by grouping by the ID and selecting the first value for each variable
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(.data[[x_var]]),  # Taking the first value for x_var per ID
      fill_value = first(.data[[fill_var]])   # Taking the first value for fill_var per ID
    )
  
  # Remove rows with NA values in either x_value or fill_value
  filtered_data <- data_summary[!is.na(data_summary$x_value) & !is.na(data_summary$fill_value), ]
  
  # Convert the fill variable to a factor
  filtered_data$fill_value <- as.factor(filtered_data$fill_value)
  
  # Generate a color palette dynamically
  colors <- setNames(c("lightblue", "darkred"), levels(filtered_data$fill_value))
  
  # Generate the plot
  ggplot(filtered_data, aes(x = .data$x_value, fill = .data$fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = colors) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

Function for making a chi squared test

calculate_chi_squared_1 <- function(data, var1, var2) {
  # Summarize the data so that each ID has one row
  data_summary <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(.data[[var1]]),  # Taking the first value for var1 per ID
      var2_value = first(.data[[var2]])   # Taking the first value for var2 per ID
    )
  
  # Remove rows with NA values in either of the variables
  filtered_data <- data_summary[!is.na(data_summary$var1_value) & !is.na(data_summary$var2_value), ]
  
  # Create the contingency table
  contingency_table <- table(filtered_data$var1_value, filtered_data$var2_value)
  
  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /net/pupil1/home/people/s243634/projects/group_19_project

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /net/pupil1/home/people/s243634/projects/group_19_project/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw//cervical_cancer_risk_factors.zip, /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /net/pupil1/home/people/s243634/projects/group_19_project/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)

  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(!!sym(x_var)),
      fill_value = first(!!sym(fill_var)),
      .groups = "drop"
    ) %>%
    filter(!is.na(x_value), !is.na(fill_value)) %>%
    mutate(fill_value = factor(fill_value))

  # Ensure color palette matches the levels of 'fill_value'
  color_palette <- setNames(
    c("lightblue", "darkred", "green", "orange")[1:length(levels(filtered_data$fill_value))],
    levels(filtered_data$fill_value)
  )

  ggplot(filtered_data, aes(x = x_value, fill = fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = color_palette) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

proportion_plot <- create_proportional_barplot(data_aug, x_var = "Dx_Cancer", fill_var = "Dx_HPV")
ggsave("../results/images/05_proportion_plot.png", plot = proportion_plot, bg = "white")
Saving 7 x 5 in image

Function for making a chi squared test

calculate_chi_squared <- function(data, var1, var2) {
  # Summarize the data so each ID has one row, taking the first occurrence of var1 and var2
  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(!!sym(var1)),
      var2_value = first(!!sym(var2)),
      .groups = "drop"
    ) %>%
    filter(!is.na(var1_value), !is.na(var2_value))  # Remove rows with NA values
  
  # Create the contingency table
  contingency_table <- filtered_data %>%
    count(var1_value, var2_value) %>%
    pivot_wider(names_from = var2_value, values_from = n, values_fill = 0) %>%
    column_to_rownames("var1_value") %>%
    as.matrix()

  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}

calculate_chi_squared(data_aug, var1 = "Dx_Cancer", var2 = "Dx_HPV")
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
$contingency_table
      0  1
no  838  2
yes   2 16

$chisq_result

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 631.83, df = 1, p-value < 2.2e-16
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /net/pupil1/home/people/s243634/projects/group_19_project

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /net/pupil1/home/people/s243634/projects/group_19_project/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw//cervical_cancer_risk_factors.zip, /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /net/pupil1/home/people/s243634/projects/group_19_project/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)

  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(!!sym(x_var)),
      fill_value = first(!!sym(fill_var)),
      .groups = "drop"
    ) %>%
    filter(!is.na(x_value), !is.na(fill_value)) %>%
    mutate(fill_value = factor(fill_value))

  # Ensure color palette matches the levels of 'fill_value'
  color_palette <- setNames(
    c("lightblue", "darkred", "green", "orange")[1:length(levels(filtered_data$fill_value))],
    levels(filtered_data$fill_value)
  )

  ggplot(filtered_data, aes(x = x_value, fill = fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = color_palette) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

proportion_plot <- create_proportional_barplot(data_aug, x_var = "Dx_Cancer", fill_var = "Dx_HPV")
ggsave("../results/images/05_proportion_plot.png", plot = proportion_plot, bg = "white")
Saving 7 x 5 in image

Function for making a chi squared test

calculate_chi_squared <- function(data, var1, var2) {
  # Summarize the data so each ID has one row, taking the first occurrence of var1 and var2
  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(!!sym(var1)),
      var2_value = first(!!sym(var2)),
      .groups = "drop"
    ) %>%
    filter(!is.na(var1_value), !is.na(var2_value))  # Remove rows with NA values
  
  # Create the contingency table
  contingency_table <- filtered_data %>%
    count(var1_value, var2_value) %>%
    pivot_wider(names_from = var2_value, values_from = n, values_fill = 0) %>%
    column_to_rownames("var1_value") %>%
    as.matrix()

  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}

calculate_chi_squared(data_aug, var1 = "Dx_Cancer", var2 = "Dx_HPV")
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
$contingency_table
      0  1
no  838  2
yes   2 16

$chisq_result

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 631.83, df = 1, p-value < 2.2e-16
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /net/pupil1/home/people/s243634/projects/group_19_project

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /net/pupil1/home/people/s243634/projects/group_19_project/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw//cervical_cancer_risk_factors.zip, /net/pupil1/home/people/s243634/projects/group_19_project/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /net/pupil1/home/people/s243634/projects/group_19_project/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)

  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(!!sym(x_var)),
      fill_value = first(!!sym(fill_var)),
      .groups = "drop"
    ) %>%
    filter(!is.na(x_value), !is.na(fill_value)) %>%
    mutate(fill_value = factor(fill_value))

  # Ensure color palette matches the levels of 'fill_value'
  color_palette <- setNames(
    c("lightblue", "darkred", "green", "orange")[1:length(levels(filtered_data$fill_value))],
    levels(filtered_data$fill_value)
  )

  ggplot(filtered_data, aes(x = x_value, fill = fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = color_palette) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

proportion_plot <- create_proportional_barplot(data_aug, x_var = "Dx_Cancer", fill_var = "Dx_HPV")
ggsave("../results/images/05_proportion_plot.png", plot = proportion_plot, bg = "white")
Saving 7 x 5 in image

Function for making a chi squared test

calculate_chi_squared <- function(data, var1, var2) {
  # Summarize the data so each ID has one row, taking the first occurrence of var1 and var2
  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(!!sym(var1)),
      var2_value = first(!!sym(var2)),
      .groups = "drop"
    ) %>%
    filter(!is.na(var1_value), !is.na(var2_value))  # Remove rows with NA values
  
  # Create the contingency table
  contingency_table <- filtered_data %>%
    count(var1_value, var2_value) %>%
    pivot_wider(names_from = var2_value, values_from = n, values_fill = 0) %>%
    column_to_rownames("var1_value") %>%
    as.matrix()

  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}

calculate_chi_squared(data_aug, var1 = "Dx_Cancer", var2 = "Dx_HPV")
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
$contingency_table
      0  1
no  838  2
yes   2 16

$chisq_result

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 631.83, df = 1, p-value < 2.2e-16
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.

01_load

01_load

Loading packages

library(readr)
library(stringr)
library(here)
here() starts at /home/people/s243903/projects/project_r4bd

Loading data

# Base URL and paths
base_url <- "https://archive.ics.uci.edu/static/public/383/cervical+cancer+risk+factors.zip"
raw_data_path <- here("data/_raw/")
processed_data_path <- here("data/")
zip_file <- file.path(raw_data_path, "cervical_cancer_risk_factors.zip")

# Check and create the raw and processed data directories
dir.create(raw_data_path, recursive = TRUE, showWarnings = FALSE)
dir.create(processed_data_path, recursive = TRUE, showWarnings = FALSE)

# Debugging: Confirm directories
message("Raw data path: ", raw_data_path)
Raw data path: /home/people/s243903/projects/project_r4bd/data/_raw/
message("Processed data path: ", processed_data_path)
Processed data path: /home/people/s243903/projects/project_r4bd/data/
# Download ZIP file and extract
download.file(url = base_url, destfile = zip_file, mode = "wb")
if (!file.exists(zip_file)) stop("Failed to download ZIP file: ", zip_file)
unzip(zip_file, exdir = raw_data_path)

# Debugging: Check extracted contents
extracted_files <- list.files(raw_data_path, full.names = TRUE)
message("Extracted files: ", paste(extracted_files, collapse = ", "))
Extracted files: /home/people/s243903/projects/project_r4bd/data/_raw//cervical_cancer_risk_factors.zip, /home/people/s243903/projects/project_r4bd/data/_raw//risk_factors_cervical_cancer.csv
# Identify the CSV file (adjust if the file name differs)
csv_file <- file.path(raw_data_path, "risk_factors_cervical_cancer.csv")

# Check if the CSV file exists and read it
if (file.exists(csv_file)) {
  data_raw <- read.csv(file = csv_file)
  message("Data successfully loaded.")
} else {
  stop("CSV file not found after extraction. Check the ZIP contents.")
}
Data successfully loaded.
# Save the loaded data as 01_data_load.csv in the processed data folder
output_file <- file.path(processed_data_path, "01_data_load.csv")
write_csv(data_raw, output_file)

# Debugging: Confirm the file is saved
if (file.exists(output_file)) {
  message("Processed data saved at: ", output_file)
} else {
  stop("Failed to save the processed data.")
}
Processed data saved at: /home/people/s243903/projects/project_r4bd/data//01_data_load.csv
02_clean

02_clean

Loaing libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data to clean

data_raw <- read.csv("../data/01_data_load.csv")

Cleaning

Correcting NAs

data_raw <- data_raw |>
  mutate(across(where(is.character), ~na_if(., "?")))

Changing binaries to ‘yes’ and ‘no’

data_raw <- data_raw |>
  mutate(
    Smokes = case_when(
      Smokes == "0.0" ~ "no",
      Smokes == "1.0" ~ "yes"
    ),
    Hormonal.Contraceptives = case_when(
      Hormonal.Contraceptives == "1.0" ~ "yes",
      Hormonal.Contraceptives == "0.0" ~ "no"
    ),
    IUD = case_when(
      IUD == "0.0" ~ "no",
      IUD == "1.0" ~ "yes"
    ),
    Dx.Cancer = case_when(
      Dx.Cancer == "0" ~ "no",
      Dx.Cancer == "1" ~ "yes"
    )
  )

Changing from character to integer

data_raw  <- data_raw |> 
  mutate(Number.of.sexual.partners = as.integer(Number.of.sexual.partners),
         First.sexual.intercourse = as.integer(First.sexual.intercourse),
         Num.of.pregnancies = as.integer(Num.of.pregnancies),
         Smokes..years. = as.numeric(Smokes..years.),
         Smokes..packs.year. = as.numeric(Smokes..packs.year.))

Updating names, removing dots

data_clean <- data_raw |> 
  rename('Smokes.years'= Smokes..years.,
         'Smokes.packs.years' = Smokes..packs.year.) |> 
  rename_with(~ gsub("^STDs\\.\\.", "", .)) |>
  rename_with(~ str_remove(.,"\\.$")) |> #removes the '.' from the last word in columns
  rename_with(~ str_replace_all(., "\\.", "_")) |> 
  rename_with(~ str_replace_all(.x, "__", "_"))

Exporting data

write_csv(x = data_clean,
          file = "../data/02_data_clean.csv")
03_augment

03_augment

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_clean <- read.csv("../data/02_data_clean.csv")

Data augmenting

Creating a Patient ID

data_clean <- data_clean |> 
  mutate(ID = row_number()) |> 
  select(ID, everything())

Creating long data

data_long <- data_clean |>
  pivot_longer(cols = starts_with("STDs_"), 
               names_to = "STD_type", 
               values_to = "has_STD") |>
  mutate(
    STD = ifelse(has_STD == 1, STD_type, NA)  # Keep STD name where 1 is present
  ) |>
  group_by(ID) |>
  mutate(
    # Concatenate STD names for each ID, if none, set "No"
    STD = ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse = ", "))
  ) |>
  ungroup() |>
  select(-STD_type, -has_STD) |>
  distinct() |> 
  separate_rows(STD, sep = ",")

Remove STDs from the beginning of each row in column STD. Capitalize each word and remove the underscores.

data_long <- data_long |> 
  mutate(STD = str_remove(STD, "^STDs\\_"),
         STD = str_remove(STD, "^ STDs\\_"),
         STD = str_replace_all(STD, "\\_", " "),
         STD = str_to_title(STD))

Remove column STDs (We have the new STD column).

data_long <- data_long |> 
  select(-STDs, -number) 

Creating new number column that counts number of the same IDs.

data_long <- data_long |> 
  group_by(ID) |> 
  mutate(Number_of_STDs = if_else(all(is.na(STD) | STD == "No"), 0, n_distinct(STD, na.rm = TRUE))) |> 
  ungroup() 

Moving STD and Number of STD columns further ahead.

data_long <- data_long |> 
  select(
    1:12,               
    STD,                
    Number_of_STDs,      
    everything()        
  )

Creating new column for PCA analysis

data_aug <- data_long |>
  mutate(Cancer = if_else(Dx_Cancer == "yes", 1, 0))

Exporting data

write_csv(x = data_aug,
          file = "../data/03_data_aug.csv")
04_describe

04_describe

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive statistics

Number of subjects

nr_of_subjects <- data_aug %>%
  summarise(distinct_ID_count = n_distinct(ID))

nr_of_subjects
# A tibble: 1 × 1
  distinct_ID_count
              <int>
1               858

Dimensions

Rows is the number of data points for STDs, so if a subject has 0 or 1 STD they will have 1 row. If a subject has multiple STDs, they will have one row per STD diagnosis.

data_aug |> 
  group_by(ID) |> 
  dim()
[1] 912  26

Age distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Age)) + 
  geom_boxplot()

Number of Cancer positives

data_aug |> 
  group_by(Dx_Cancer) |> 
  count()
# A tibble: 2 × 2
# Groups:   Dx_Cancer [2]
  Dx_Cancer     n
  <chr>     <int>
1 no          894
2 yes          18

Smoking

Number of smokers

data_aug |> 
  group_by(Smokes) |> 
  count()
# A tibble: 3 × 2
# Groups:   Smokes [3]
  Smokes     n
  <chr>  <int>
1 no       759
2 yes      137
3 <NA>      16

Distribution of years of smoking

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Smokes_years)) + 
  geom_boxplot()
Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Sexual health

First sexual intercourse distribution

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = First_sexual_intercourse)) + 
  geom_boxplot()
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of pregnancies

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Num_of_pregnancies)) + 
  geom_boxplot()
Warning: Removed 59 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Number of sexual partners

data_aug |> 
  group_by(ID) |> 
  ggplot(aes(x = Number_of_sexual_partners)) + 
  geom_boxplot()
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).

05_analysis_1

05_analysis_1

Loading libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)

Loading data

data_aug <- read_csv("../data/03_data_aug.csv")
Rows: 912 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Smokes, Hormonal_Contraceptives, IUD, STD, Dx_Cancer
dbl (21): ID, Age, Number_of_sexual_partners, First_sexual_intercourse, Num_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting

Numb_sex <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = Number_of_sexual_partners, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Number of Sexual Partners") +
  theme_bw()


first_sex_int <- ggplot(data = data_aug, aes(y = Dx_Cancer, x = First_sexual_intercourse, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
   labs(y = "Cancer Diagnosis",
        x = "Age of First Sexual Intercourse") +
  theme_bw()


boxplot_sex_his <- (Numb_sex / first_sex_int) +
  plot_annotation(
    title = "Comparison of Sexual History and Cancer Diagnosis")

boxplot_sex_his
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

#save plot
ggsave("../results/images/05_boxplot_sex_his.png", plot = boxplot_sex_his)
Saving 7 x 5 in image
Warning: Removed 26 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Normalized Counts of STD by Cancer Diagnosis

data_normalized_STD <- data_aug |>
  group_by(Dx_Cancer, STD) |>
  summarize(count = n(), .groups = "drop") |>
  group_by(Dx_Cancer) |>
  mutate(prop = count / sum(count))


data_normalized_STD <- data_normalized_STD |>
  mutate(STD = fct_reorder(STD, count, .desc = TRUE))

barplot_STD <- ggplot(data = data_normalized_STD, aes(x = STD, y = prop, fill = Dx_Cancer)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(~Dx_Cancer) +
  labs(title = "Proportion of STD type by Cancer Diagnosis",
       x = "STD",
       y = "Proportion",
       fill = "Cancer Diagnosis") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate text
    plot.margin = margin(10, 10, 10, 50)  # Increase the left margin
  )

barplot_STD

#save plot
ggsave("../results/images/05_barplot_STD.png", plot = barplot_STD)
Saving 7 x 5 in image

Age and Cancer diagnosis

ggplot(data = data_aug, aes(y = Dx_Cancer, x = Age, fill = Dx_Cancer)) +
  geom_boxplot(show.legend = FALSE) +
  labs(y = "Cancer Diagnosis") +
  theme_bw()

Function for making a barplot

create_proportional_barplot <- function(data, x_var, fill_var, 
                                        x_label = NULL, y_label = "Proportion", 
                                        fill_label = NULL, title = NULL) {
  if (is.null(x_label)) x_label <- x_var
  if (is.null(fill_label)) fill_label <- fill_var
  if (is.null(title)) title <- paste("Proportion of", fill_var, "by", x_var)

  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      x_value = first(!!sym(x_var)),
      fill_value = first(!!sym(fill_var)),
      .groups = "drop"
    ) %>%
    filter(!is.na(x_value), !is.na(fill_value)) %>%
    mutate(fill_value = factor(fill_value))

  # Ensure color palette matches the levels of 'fill_value'
  color_palette <- setNames(
    c("lightblue", "darkred", "green", "orange")[1:length(levels(filtered_data$fill_value))],
    levels(filtered_data$fill_value)
  )

  ggplot(filtered_data, aes(x = x_value, fill = fill_value)) +
    geom_bar(position = "fill") +
    labs(x = x_label, y = y_label, fill = fill_label, title = title) +
    scale_fill_manual(values = color_palette) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
}

proportion_plot <- create_proportional_barplot(data_aug, x_var = "Dx_Cancer", fill_var = "Dx_HPV")
ggsave("../results/images/05_proportion_plot.png", plot = proportion_plot, bg = "white")
Saving 7 x 5 in image

Function for making a chi squared test

calculate_chi_squared <- function(data, var1, var2) {
  # Summarize the data so each ID has one row, taking the first occurrence of var1 and var2
  filtered_data <- data %>%
    group_by(ID) %>%
    summarise(
      var1_value = first(!!sym(var1)),
      var2_value = first(!!sym(var2)),
      .groups = "drop"
    ) %>%
    filter(!is.na(var1_value), !is.na(var2_value))  # Remove rows with NA values
  
  # Create the contingency table
  contingency_table <- filtered_data %>%
    count(var1_value, var2_value) %>%
    pivot_wider(names_from = var2_value, values_from = n, values_fill = 0) %>%
    column_to_rownames("var1_value") %>%
    as.matrix()

  # Perform the chi-squared test
  chisq_result <- chisq.test(contingency_table)
  
  return(list(contingency_table = contingency_table, chisq_result = chisq_result))
}

calculate_chi_squared(data_aug, var1 = "Dx_Cancer", var2 = "Dx_HPV")
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
$contingency_table
      0  1
no  838  2
yes   2 16

$chisq_result

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 631.83, df = 1, p-value < 2.2e-16
PCA

PCA

Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggrepel)

Loading

data_aug <- read.csv("../data/03_data_aug.csv")

Making a PCA plot based on patients have cancer or not. Created a new column based on Dx_cancer

cancer_data <- data_aug |> 
  select(-Time_since_first_diagnosis, -Time_since_last_diagnosis, Dx_CIN) |> drop_na() 

pca_fit_cancer <- cancer_data |> 
  select(where(is.numeric)) |>  
  prcomp(scale = TRUE)

scatterplot_pca <- pca_fit_cancer |> 
  augment(cancer_data) |> 
  ggplot(aes(.fittedPC1, .fittedPC2, color = factor(Cancer))) + 
  geom_point(size = 1.5) +
  theme_half_open(12) + background_grid() + scale_color_discrete(labels = c("0" = "Negative", "1" = "Positive")) +
  labs(color = "Cervical Cancer Present") + ggtitle("PCA Plot") + theme(plot.title = element_text(hjust = 0.5))

summary(pca_fit_cancer)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.7460 1.5531 1.4707 1.32729 1.19567 1.11335 1.04722
Proportion of Variance 0.1605 0.1270 0.1138 0.09272 0.07524 0.06524 0.05772
Cumulative Proportion  0.1605 0.2874 0.4012 0.49396 0.56921 0.63445 0.69217
                           PC8     PC9    PC10   PC11    PC12    PC13   PC14
Standard deviation     1.02522 0.97290 0.93986 0.9028 0.83156 0.65611 0.5195
Proportion of Variance 0.05532 0.04982 0.04649 0.0429 0.03639 0.02266 0.0142
Cumulative Proportion  0.74749 0.79730 0.84380 0.8867 0.92309 0.94575 0.9599
                          PC15   PC16   PC17    PC18    PC19
Standard deviation     0.50325 0.4530 0.4088 0.28497 0.23277
Proportion of Variance 0.01333 0.0108 0.0088 0.00427 0.00285
Cumulative Proportion  0.97328 0.9841 0.9929 0.99715 1.00000
scatterplot_pca

#save plot
ggsave("../results/images/06_scatterplot_pca.png", plot = scatterplot_pca, bg = "white")
Saving 7 x 5 in image

The negative patients form a dense cluster in the right corner, whereas it is more spread out to the lower-left corner for the positive ones. Patients with positive cancer diagnosis have more distinct patterns in their values compared to those with no cancer. This separation suggests that features e.g., age, smoking or STD’s might contribute strongly to distinguish cancer-positive patients.

The data in PC coordinates

pca_fit_cancer |> 
  tidy(matrix = "rotation")
# A tibble: 361 × 3
   column    PC   value
   <chr>  <dbl>   <dbl>
 1 ID         1  0.0110
 2 ID         2 -0.0333
 3 ID         3 -0.222 
 4 ID         4  0.235 
 5 ID         5 -0.167 
 6 ID         6 -0.0169
 7 ID         7 -0.218 
 8 ID         8  0.353 
 9 ID         9 -0.213 
10 ID        10  0.653 
# ℹ 351 more rows
# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(10, "pt")
)


pc_coordinates <- pca_fit_cancer |>
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(
    aes(label = column),
    color = "darkblue"
  ) +
  #xlim(-0.5, .3) + ylim(-.25, 0.5) +
  coord_fixed() +
  theme_minimal_grid(12) +
  ggtitle("The data in the PC coordinates") + 
  theme(plot.title = element_text(hjust = 0.5))


#save plot
ggsave("../results/images/06_pc_coordinates.png", plot = pc_coordinates, bg = "white")
Saving 7 x 5 in image

The arrows shows how the features contribute to the principal components analysis.

The variance explained by each PC

A general rule is to keep the components that explains up to 95% of the variance.

pca_fit_cancer |>
  tidy(matrix = "eigenvalues")
# A tibble: 19 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   1.75  0.160        0.160
 2     2   1.55  0.127        0.287
 3     3   1.47  0.114        0.401
 4     4   1.33  0.0927       0.494
 5     5   1.20  0.0752       0.569
 6     6   1.11  0.0652       0.634
 7     7   1.05  0.0577       0.692
 8     8   1.03  0.0553       0.747
 9     9   0.973 0.0498       0.797
10    10   0.940 0.0465       0.844
11    11   0.903 0.0429       0.887
12    12   0.832 0.0364       0.923
13    13   0.656 0.0227       0.946
14    14   0.519 0.0142       0.960
15    15   0.503 0.0133       0.973
16    16   0.453 0.0108       0.984
17    17   0.409 0.0088       0.993
18    18   0.285 0.00427      0.997
19    19   0.233 0.00285      1    
pc_variance <- pca_fit_cancer |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col(fill = "darkblue", alpha = 0.8) +
  scale_x_continuous(breaks = 1:17) +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.01))
  ) +
  theme_minimal_hgrid(12) +
  ggtitle("The variance explained by each PC") + theme(plot.title = element_text(hjust = 0.5))

#save plot
ggsave("../results/images/06_pc_variance.png", plot = pc_variance, bg = "white")
Saving 7 x 5 in image

After PC4 and PC% the variance explained by each PC drops significantly forming an ‘elbow’. After PC10 the other components explain very little variance. These components might not be useful for analyzing or modeling.